Small area estimates of vote shares: the 2022 Australian federal election
We provide estimates of vote shares at the SA1 level from the 2022 Australian federal election, making use of a mapping from polling places to SA1s provided by the Australian Electoral Commission and methods for solving inverse problems from statistics, computer science and the social sciences.
Political parties, candidates and analysts seek intelligence about voting intentions for small geographic units, so as to efficiently allocate campaign resources and to better understand the predictors of vote choices. This involves using data observed at relatively coarse levels of spatial resolution (e.g., electoral divisions, polling places) to make inferences with respect to component smaller areal units (e.g., postcodes, neighbourhoods), a long-standing problem in statistics and the social sciences.
SA1 or “Statistical Area 1” is the most granular unit of geography for which the Australian Bureau of Statistics (ABS) provides tabulations of demographic and social characteristics, with a median adult citizen population of just 263 in the 2021 Census.
After each federal election, the AEC produces a data file of voter turnout counts for each polling place utilized by voters in each SA1. The AEC also provides vote tallies for each polling place.
Even with the correspondence between polling places and SA1s, using polling place vote tallies to infer the distribution of votes by party at the SA1 level lacks a unique solution (a case of an ill-posed inverse problem). This is seldom acknowledged by practitioners who typically employ a simple, deterministic algorithm for estimating SA1 vote distributions, presented in Section 6.1.
We assess different approaches for generating plausible estimates of SA1 level vote shares, comparing their utility to practitioners and analysts. Our preferred approach draws on hierarchical Bayesian modeling, utilizing Census information about the social and demographic composition of SA1s to supplement the polling place vote tallies and the polling place/SA1 correspondence.
The methodology we survey and utilize here has many applications in policy settings such as health, transport, education and public safety and commercial settings such as marketing and property development.
1 The problem, briefly stated
We observe counts of votes cast for each party/candidate at each polling place (\(\boldsymbol{y}\)). Separately, we have data providing turnout counts at each polling place disaggregated to small spatial units (SA1s), which we encode in a matrix \(\boldsymbol{A} = \{ A_{ij} \}\) with \(A_{ij}\) the number of voters turning out at polling place \(i\) who reside in SA1 \(j\). What we want to recover is the SA1 level distribution of votes for each party, \(\boldsymbol{x}\).
Given \(\boldsymbol{y}\) and \(\boldsymbol{A}\) can we recover \(\boldsymbol{x}\)? Further, do social and demographic features of SA1s, \(\boldsymbol{z}\) help with the recovery of \(\boldsymbol{x}\)?
A graphical sketch of the model appears below; the node corresponding to the unobserved \(\boldsymbol{x}\) is displayed as a circle, while observed quantities are represented with squares.
2 Key terms: electoral divisions, polling places and Census geography
Voter enrolment and turnout is compulsory in Australia. Electoral authorities maintain a high quality list of enrolled voters and their current addresses, often relying on information from other government agencies to keep address information current.
Since 2019 Australia’s House of Representatives has comprised 151 single-member electoral divisions. The median number of voting locations (including postal voting) was 55 per division, with 139 locations used in the large and sparsely populated division of Grey in South Australia and just 34 in the division of Solomon centred on Darwin in the Northern Territory.
Electoral divisions sometimes span multiple local government areas and state legislative districts, but elections at these lower levels of government are not held concurrently with a Federal election.1 Accordingly, no great administrative complexity arises from not insisting that voters turn out at polling places close to their registered address. In short, Australian voters are not allocated to precincts within constituencies as is the case in the United States.
Consistent with making the legal obligation to turn out easy to fulfil, Australian voters have many options as to where and how or when to turn out. So, while many Australians opt to turn out at the polling place closest to their residence, this is far from universal. Consequently, vote shares reported for a given polling place are generated by a mix of voters from across the enclosing division.
3 Polling places draw on many SA1s
The AEC has produced a file that provides House of Representatives voter turnout counts from the 2022 election at each of the polling places utilised by the voters residing in each SA1 (the quantities denoted \(A_{ij}\), above).
8,449 unique pp_id (polling places) with non-zero vote counts
and 22,559,296 voters.
For each SA1 we compute the number of polling places utilised by its voters and the share of its total votes from each of the pp_id used by its voters. Figure 1 shows that the modal number of polling places used per SA1 is 17, with a small cluster of SA1s whose voters make use of just 1 polling place.
Figure 1: Histogram, number of polling places used per SA1
We examine the spread of SA1 votes across polling places in the following table: for a each SA1 we order the contributing polling places by the number of votes from the SA1 cast at that polling place. We then compute the cumulative share of SA1 turnout from the largest/most-utilised polling place to the smallest/least-utilised polling place. After a cumulative \(n\) = 1, 2, \(\ldots\) polling places, we compute the median proportion of SA1 turnout and the 5th and 95th percentile of the cumulative turnout across SA1s. These quantities are reported in the table below:
tab_spread =transpose(tab_ojs)
Inputs.table(tab_spread, {columns: ["i","q50","q05","q95" ],align: {i:"center",q50:"center",q05:"center",q95:"center" },format: {i: x => x.toFixed(0),q50: x => (100*x).toFixed(1),q05: x => (100*x).toFixed(1),q95: x => (100*x).toFixed(1) },header: {i:"Cumulative polling places",q50:"Cumulative share of SA1 turnout (median)",q05:"5%",q95:"95%" }})
We can also examine the mapping from SA1s to polling places, producing the following table. Absentee votes, postal votes and other forms of declaration votes are bundled into one pseudo polling place for entire electoral divisions, which usually draw from almost all SA1s spanned by the division; we exclude votes cast in this fashion from the calculations in the table below, while including them in the analysis elsewhere in this report.
pp_sa1 =transpose(pp_sa1_ojs)
Inputs.table(pp_sa1, {columns: ["i","q50","q05","q95" ],align: {i:"center",q50:"center",q05:"center",q95:"center" },format: {i: x => x.toFixed(0),q50: x => (100*x).toFixed(1),q05: x => (100*x).toFixed(1),q95: x => (100*x).toFixed(1) },header: {i:"Cumulative SA1s",q50:"Cumulative share of polling place turnout (median)",q05:"5%",q95:"95%" }})
This inspection of the data reveals
typically, combining results from many polling places will be required to generate estimates of vote shares at the SA1 level, or conversely, many SA1s contribute to polling place level results.
while it is sometimes the case that one or two polling places account for the bulk of votes from a given SA1, this is relatively unusual. More often than not, even six polling places account for less than 90% of the turnout of a SA1, and typically about nine polling places are required to account for 95% of turnout in any SA1.
conversely, about six or seven SA1s generally account for 50% of the votes cast at a given polling place, but we usually need fifty-seven SA1s to cover 95% of the votes cast at a polling place.
4 Estimating SA1 level vote shares
The mapping from SA1s to polling places is not especially sparse. Of the voters turning out at a polling place, usually only a small proportion come from a given SA1 and they usually constitute only a small proportion of the voters residing in a SA1. Thus, in general, vote counts observed at the polling place level will not supply much information about the vote shares of any particular SA1.
To make progress, we consider the following model of polling place vote tallies. With \(P\) parties/candidates, each polling place \(i \in 1, \ldots, n\) produces a vector of \(P\) vote counts \(y_i = (y_{i1}, \ldots, y_{iP})'\). SA1 \(j \in 1, \ldots, m\) has \(A_{ij}\) voters turning out at polling place \(i\), who contribute unobserved vote counts \(\zeta_{ij} = (\zeta_{ij1}, \zeta_{ij2}, \ldots, \zeta_{ijP})'\) to \(y_i\).
By construction, \(y_i\) is the piecewise sum of vectors of vote counts \(\zeta_{ij}\), where \(j\) indexes the originating SA1s: \[
y_{i} = \sum_{j=1}^m \zeta_{ij}, \qquad \sum_{p=1}^P \zeta_{ijp} = A_{ij}
\tag{1}\]
where the summation over SA1s \(j = 1, \ldots, m\) is piecewise with respect to each of the \(P\) elements of \(\zeta_{ij}\).3
It is sometimes convenient to express the unobserved vote counts \(\zeta_{ij}\) in terms of unobserved proportions \(\lambda_{ij}\), where \(\zeta_{ij} = \lambda_{ij} \cdot A_{ij}\). In particular, we estimate SA1 level vote counts by summing over the vote shares (or counts) originating from SA1 \(j\) recorded at polling places \(i = 1, \ldots, n\):
with the summation over the destination polling places \(i\) for voters from SA1 \(j\). Trivially, SA1 vote proportions are simply an element-wise, weighted average of the \(\boldsymbol{\lambda}\) computed over the destination polling places for SA1 \(j\): \[
(\pi_{j1}, \pi_{j2}, \ldots, \pi_{jP})' = \, \pi_j = \,
\sum_{i=1}^n \frac{\zeta_{ij}}{A_{ij}}
\, = \,
\frac{\sum_{i=1}^n \lambda_{ij} \cdot A_{ij}}{\sum_{i=1}^n A_{ij}}.
\tag{3}\]
Equation 1 highlights a key feature of the inferential task at hand. We have \(n \times P\) pieces of information in \(\boldsymbol{y}\), the \(n\) polling place level vectors of vote counts for the \(P\) parties/candidates. Our model expresses these in terms of a weighted sum of We seek estimates of \(m > n\) SA1 level vectors of vote counts \(\boldsymbol{x}\), or \(m \times P\) unknown quantities. Equation 1 and Equation 2 shows that the relationship between \(\boldsymbol{y}\) and \(\boldsymbol{x}\) is linear, but with \(m > n\) the implied system of linear equations is under-determined, with a set of possible \(\zeta_{ij}\) (or equivalently, \(\lambda_{ij}\)) that satisfy Equation 1 rather than a unique solution.
In the language of mathematics, computer science and engineering, our problem is one of trying to recover an unobserved input from an output: an “inverse” problem”. Further, the inverse problem here is “ill-posed” in that the measurement process compresses or coarsens the SA1 level inputs to polling place level outputs, generating the analogue to an under-determined system of equations. The SA1-to-polling place turnout counts \(A_{ij}\) only supply information about the sum of the \(P\) elements of \(\zeta_{ij}\), reducing the solution space but leaving many configurations of SA1-level vote tallies consistent with the polling place tallies.
From the outset, therefore, we ought to concede that there is no unique set of \(m\) SA1 level vote tallies \(\boldsymbol{x}\) consistent with the \(n\) polling place tallies \(\boldsymbol{y}\). This is seldom acknowledged by practitioners who typically employ the simple, deterministic algorithm for estimating \(\boldsymbol{x}\) we study in Section 6.1. Our purpose here is to restate this overlooked point, explore feasible solutions and observe what these estimates (and the variation among them) imply for downstream inferences about political preferences.
5 Example: the electoral division of Wentworth
We consider an example from the 2022 election, the House of Representatives division of Wentworth in Sydney’s Eastern suburbs. This division was one of six “blue ribbon” seats where Liberal Party incumbents were defeated by independent candidates and so the subject of considerable campaign effort, analysis and media attention.
Wentworth encompasses some of Australia’s wealthiest neighbourhoods on the southern shore of Sydney Harbour to the east of the Sydney CBD (e.g., Darling Point, Double Bay, Point Piper, Rose Bay, Vaucluse, Watson’s Bay), but also includes less wealthy and more diverse neighborhoods such as Bondi, Bondi Junction, Darlinghurst, Kings Cross, Randwick and Clovelly.
This variation is reflected in the vote shares across the various polling places in the table, below, displaying the 57 polling place level votes shares as percentages, along the total number of votes cast at each polling place; these data are freely available from the AEC. We collapse votes for One Nation and the Palmer United Australia Party into one “Other” group. “INF” are informal or “spoiled” ballots. The table can be sorted by clicking on the column headers.
y_wentworth =transpose(y_wentworth_ojs)
Inputs.table(y_wentworth, {rows:1000,columns: ["pp_nm","votes_total_pp","ALP","GRN","IND","LP","OTH","INF" ],align: {pp_nm:"left" },format: {votes_total_pp: x => d3.format(",")(x),ALP: x => (100*x).toFixed(1),GRN: x => (100*x).toFixed(1),IND: x => (100*x).toFixed(1),LP: x => (100*x).toFixed(1),OTH: x => (100*x).toFixed(1),INF: x => (100*x).toFixed(1) },header: {pp_nm:"Polling Place",votes_total_pp:"Votes" }})
Figure 2: Wentworth vote shares by polling place, 2022 House of Representatives election. Source: Australian Electoral Commission.
Wentworth spans 345 SA1s, depicted in the following map. Orange circles correspond to polling places geo-coded by the AEC; mobile or “roaming” polling places are excluded (e.g., Special Hospital Team 1) as are postal votes, declaration pre-poll votes and absentee votes. Rolling over each polling place will distinguish the smallest set of SA1s contributing 3/4s of turnout at that polling place (darker shading) from other SA1s contributing turnout to the polling place (lighter shading); SA1s not contributing turnout to the polling place are not displayed.
This presentation reveals a reasonable degree of geographic concentration in Election Day, in-person turnout. Polling places occupying the same physical location (as a pre-poll voting centre and then as an Election Day, in-person polling place) are jittered so as to be visually distinct.
The allocation from SA1s to polling places is given by the elements of \(\boldsymbol{A}\), in this case a 57 (polling places) by 345 (SA1s) matrix. Of the 19,665 entries in \(\boldsymbol{A}\) only 8,349 or 42.5% are non-zero, since while many SA1s contribute to any given polling place, most usually do not.4
The sparsity of the \(\boldsymbol{A}\) matrix for Wentworth is presented graphically below, with the filled cells in the grid corresponding to non-zero elements of \(\boldsymbol{A}\). Absentee voting, postal ballot and forms of early voting (e.g., pre-poll voting centres or PPVC) draw on almost every SA1 in Wentworth, with the sparsity arising from most Election Day, in-person turnout concentrating at polling places spatially proximate to the voter’s SA1.
Figure 3: Contributions of SA1s to polling places in the division of Wentworth in Australia’s 2022 elections for the House of Representatives. Source: Australian Electoral Commission SA1 file.
The information in Figure 2 and summarised in Figure 3 constitute the data at hand for now; we introduce Census information at the SA1 level in section XXX.
We now consider two different estimation strategies with the information at hand.
a reverse strategy, working from the polling place vote shares \(\boldsymbol{y}\) backwards through the model to form estimates of \(\boldsymbol{x}\).
a forward model, treating the \(\lambda\) as unknown parameters in a statistical model for \(\boldsymbol{y}\). This approach opens the possibility of SA1-level covariates \(\boldsymbol{z}\) entering the model.
6 The reverse strategy
The reverse strategy is widely used in practice. We consider two approaches using this strategy, a deterministic estimator and its simulation or permutation-based analogue.
6.1 Deterministic estimator
Each \(\lambda_{ij}\) in Equation 2 is to set to the vote proportions observed at its corresponding polling place \(i\), i.e.,
\[
\hat{\lambda}_{ij} = y_i / n_i
\tag{4}\]
where \(n_i = \sum_{j=1}^m A_{ij}\) is the number of voters at polling place \(i\). One rationalization for this estimator is that absent any other information, the segment of voters at polling place \(i\) originating from SA1 \(j\) can not be distinguished from any of the other voters at polling place \(i\). From the perspective of the analyst, it is as if the set of voters at polling place \(i\) from SA \(j\) comprise a random sample from the set of all voters turning out at polling place \(i\). All such segments are hence exchangeable and hence we should assign any particular segment the same vote probabilities as any other segment, for which our best guess is simply the observed vote proportions at polling place \(i\) (Equation 4).
The estimator in Equation 4 is (a) trivially consistent with the observed data — confirmed by substituting \(\hat{\lambda}_{ij}\) from Equation 4 into Equation 1; (b) relies on assumptions so seemingly uncontroversial that they need no justification and are rarely even stated; (c) is easily computed using spreadsheets.
An elaboration of this estimator comes from noting subsets of the voters turning out at polling place \(i\) will almost surely have different vote shares than those of the entire set. That is, for any polling place \(i\), its SA1-specific segments \(\lambda_{ij}\) will almost surely vary around the polling place shares \(y_i/n_i\), a function of the political heterogeneity of the SA1s and the processes leading voters to turn out at particular polling places. These factors will induce more or less variation in \(\lambda_{ij}\).
So, while Equation 4 provides a valid estimate of SA1-level votes — in the sense of being consistent with the observed data — it is almost certainly underestimating variation in the \(\lambda_{ij}\)within polling places and hence the SA1 vote shares themselves. While recourse to the “principal of insufficient reason” CHECK THIS leads to the deterministic estimator, any set of \(\lambda_{ij}\) that satisfy the equality in Equation 1 is as valid as any other.
In turn, this observation speaks to the “ill-posed” nature of this inferential problem: in general, there is no unique solution when trying to invert the aggregation of \(m\) SA1-level vote shares to \(n\) polling place sets of observations, when \(m > n\).
6.2 Permutation-based estimator
Simulation-based methods let us operationalise the variation of \(\lambda_{ij}\) around the polling place level vote shares \(y_i/n_i\). For polling place \(i\), each SA1 contributes \(A_{ij}\) voters, with an unknown set of vote proportions \(\lambda_{ij}\). A permutation-based method for simulating variation in the unobserved \(\lambda_{ij}\) is
define a vector \(v_i = 1, \ldots, n_i\), indexing voters turning out at polling place \(i\); without loss of generality, partition this vector into bins of size \(y_i = (y_{i1}, y_{i2}, \ldots, y_{iP})'\) each \(y_{ip}\) corresponding to the number of votes for party/candidate \(p\) recorded at polling place \(i\).
for arbitrarily many iterations \(k = 1, \ldots, K\), randomly assign the elements of \(v_i\) to \(m\) partitions each of size \(A_{ij}\), where \(j \in 1, \ldots, m\) indexes the SA1s contributing votes to polling place \(i\). This procedures yields a random permutation of votes counted at polling place \(i\) into originating SA1s. By construction, each such random permutation is consistent with (a) the polling place vote tallies \(y_i\) and (b) the known sizes of the segments of voters by their originating SA1 (\(A_{ij}\)).
Compute \(\zeta^{(k)}_{ij} = (\zeta_{ij1}^{(k)}, \zeta_{ij2}^{(k)}, \ldots, \zeta_{ijP}^{(k)})'\) counts of the \(v_i\) voting for party/candidate \(p = 1, \ldots, P\) among those \(v_i\) assigned to SA1 \(j \in 1, \ldots, m\) under permutation \(k\).
Recalling Equation 2, form the vectors of SA1-level vote counts \(\hat{x}^{(k)}_j = \sum_{i=1}^n \zeta^{(k)}_{ij}\).
Average over \(K\) iterations to form estimators \(\hat{\zeta}_{ij} = \frac{1}{K} \sum_{k=1}^K \zeta^{(k)}_{ij}\); we form 90% credible intervals by computing the 5% and 95% percentiles of each element of the \(K\)\(\hat{\zeta}_{ij}\). These estimated vote counts can be also be reported as proportions via Equation 3.
6.3 Reverse estimation: application to Wentworth
Figure 4: Estimates of vote in Wentworth SA1s, 2022 Federal election, based on 50,000 permutations of votes to SA1s. Points correspond to estimates vote shares and horizontal lines cover 90% credible intervals. Estimates for SA1s with fewer than 10 voters are highly imprecise and are excluded from the graphs. Each panel corresponds to results with respect to the indicated party; the data are ordered within each panel; the vertical orange line on each panel indicate the division-wide vote share for the indicated party.
Stack the elements of \(\boldsymbol{y}\) as a column vector of length \(nP\)\[
y =
\left[
\begin{aligned}
y_{11} \\
y_{12} \\
\vdots \\
y_{1P} \\
y_{21} \\
\vdots \\
y_{nP}
\end{aligned}
\right]
\] and similarly stack the vectors \(\zeta_{ij}\), \(i \in 1, \ldots, n\) and \(j \in 1, \ldots, m\).
The codes used in the AEC’s SA1 files are actually 7-digit SA1 identifiers from the 2016 Census. We map these to SA1s used in the 2021 Census using a correspondence file provided by the ABS.↩︎
In practice, for computational efficiency we define \(\zeta_{ij}\) and summations over them only for those \((i,j)\) pairs where \(A_{ij} > 0\).↩︎
We find discrepancies between polling place turnout in the vote tallies in AEC results and polling place turnout levels derived by summing the rows of \(\boldsymbol{A}\). For instance, in the AEC’s election results, total House of Representatives turnout in Wentworth is 91,200 while summing over the relevant entries in the AEC’s SA1 file produces a figure of 91,263. These discrepancies are typically on the order of a handful of votes at the polling place level and for 12 out of 57 polling places the two turnout counts match. We reconcile these discrepancies by (a) for each polling place, normalise the vote tallies for each party plus informal ballots to sum to the polling place turnout derived from the SA1 file, keeping the parties’ vote shares constant; (b) rounding the resulting tallies to the nearest integer; (c) any final adjustment to make the two turnout counts match is made by a (one vote) increment or decrement to the count of informal ballots.↩︎